5 research outputs found

    State-of-the-art web data extraction systems for online business intelligence

    Get PDF
    The success of a company hinges on identifying and responding to competitive pressures. The main objective of online business intelligence is to collect valuable information from many Web sources to support decision making and thus gain competitive advantage. However, the online business intelligence presents non-trivial challenges to Web data extraction systems that must deal with technologically sophisticated modern Web pages where traditional manual programming approaches often fail. In this paper, we review commercially available state-of-the-art Web data extraction systems and their technological advances in the context of online business intelligence.Keywords: online business intelligence, Web data extraction, Web scrapingŠiuolaikinės iš tinklalapių duomenis renkančios ir verslo analitikai tinkamos sistemos (anglų k.)Tomas Grigalis, Antanas Čenys Santrauka Šiuolaikinės verslo organizacijos sėkmė priklauso nuo sugebėjimo atitinkamai reaguoti į nuolat besi­keičiančią konkurencinę aplinką. Internete veikian­čios verslo analitinės sistemos pagrindinis tikslas yra rinkti vertingą informaciją iš daugybės skirtingų internetinių šaltinių ir tokiu būdu padėti verslo orga­nizacijai priimti tinkamus sprendimus ir įgyti kon­kurencinį pranašumą. Tačiau informacijos rinkimas iš internetinių šaltinių yra sudėtinga problema, kai informaciją renkančios sistemos turi gerai veikti su itin technologiškai sudėtingais tinklalapiais. Šiame straipsnyje verslo analitikos kontekste apžvelgiamos pažangiausios internetinių duomenų rinkimo siste­mos. Taip pat pristatomi konkretūs scenarijai, kai duomenų rinkimo sistemos gali padėti verslo anali­tikai. Straipsnio pabaigoje autoriai aptaria pastarųjų metų technologinius pasiekimus, kurie turi potencia­lą tapti visiškai automatinėmis internetinių duomenų rinkimo sistemomis ir dar labiau patobulinti verslo analitiką bei gerokai sumažinti jos išlaidas

    The impact of cultural differences on perceived website quality

    No full text
    The object of this master’s thesis is web sites, and the goal is to explore how cultural differences influence the perception of web sites quality. This goal is achieved by completing the following tasks: to explore the concept of culture; to find a relationship between the individual and the culture and the impact of culture to individual perception; to explore the Internet and culture-related studies, to find the most commonly used models of cultural dimensions to analyze the website between cultures; to examine the impact of cultural differences on web content; to investigate and ascertain whether there is a significant difference between the perception of website quality from distant and nearby cultures. The paper analyses scientific literature as well as quantitative and qualitative data collected through online questionnaire survey. The analysis of scientific literature showed that different researchers perceive the concept of culture in its own way. However, all authors commonly perceived culture as a way of thinking shared by group of people. The scientific literature dealing with culture and web sites usually uses Hall (1997), Hofstede (2001), Trompenaars and Hampden-Turden (1997) cultural dimension models. These models allow each culture to be defined on certain groups of scales, and thus reveal the essential differences between cultures. Culture forms the mindset of the individual and affects his perception of Web sites. Thus web surfers from different cultures have different perceptions of the importance and significance of various web site elements. The websites have different culture specific elements and there is different emphasis given to certain information, reflecting the culture. Furthermore sites from different cultures may vary in use of color, the specific use of symbols, information classification methods, and etc. So culturally congruent website reduces the cognitive effort needed to understand web content and creates an atmosphere in which communication messages are clearer, interaction is better. The empirical examination showed that cultural differences has a statistically significant (p <0.001) impact on perceived website quality. Thus, the hypothesis has proved to be right, meaning that culturally congruent web sites are perceived as of better quality

    Struktūrizuotų duomenų išgavimas iš tinklalapių sugeneruotų pagal šablonus

    No full text
    Most of structured data on the Web is found in database-backed web sites. Typically, upon a web page request in such a site, structured data is retrieved from an underlying database and embedded into a web page using some fixed template. Reverse engineering task – extracting structured data from template-generated web pages is studied in this dissertation. There are thousands of web pages on the Web that differ in visual style and underlying structure. Automatically extracting structured data from many structurally heterogonous template-generated web pages is a difficult and time consuming task, and it is regarded as a grand challenge. It is assumed, that solving the challenge would improve todays’ Web search and help companies to reduce costs. Thus the main goal of the dissertation is to propose a novel and more effective method for extracting structured data from template-generated web pages. The object of the research in this dissertation is structured data extraction from template-generated web pages. The dissertation consists of introduction, four main chapters and general conclusions. In the first Chapter the problem of structured web data extraction is introduced, state-of-the-art data extraction techniques are reviewed and finally real life applications for structured web data extraction systems are discussed. In the second Chapter a novel method for extracting structured data records from template-generated web pages is presented. The method is based on clustering visually and structurally similar web page elements. It first renders a given web page in a contemporary web browser, and then clusters visually and structurally similar repeating web page elements to identify an underlying pattern of embedded structured data records. Finally a data extracting wrapper is generated. The wrapper consists of XPath expressions that can be easily reused in many third party data extracting applications. In the third Chapter a novel method for structurally clustering template-generated web pages is proposed. The method is based on the three observations: that there is a limited number of different style templates in one particular template-generated web site; that there is a limited number of inner-site link locations in all templates of a same site; that each individual location in a web page containing a link usually points to structurally similar web pages. The method leverages XPath locations of inbound inner-site links to significantly speed up web page clustering time. In the final fourth Chapter more than one million web pages are used to experimentally evaluate the two proposed methods. The results reveal that the both proposed methods consistently outperform other state-of-the-art techniques

    Using XPaths of inbound links to cluster template-generated web pages

    No full text
    Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving>90% accuracy

    Analysis of automated modern web crawling and testing tools and their possible employment for information extraction

    No full text
    World Wide Web has become an enormously big repository of data. Extracting, integrating and reusing this kind of data has a wide range of applications, including meta-searching, comparison shopping, business intelligence tools and security analysis of information in websites. However, reaching information in modern WEB 2.0 web pages, where HTML tree is often dynamically modified by various JavaScript codes, new data are added by asynchronous requests to the web server and elements are positioned with the help of cascading style sheets, is a difficult task. The article reviews automated web testing tools for information extraction tasks. Article in Lithuanian. Šiuolaikinių tinklalapių automatizuotam naršymui ir testavimui skirtų priemonių analizė ir pritaikomumas informacijai rinkti Santrauka.&nbsp;Internetui tapus milžiniška informacijos duomenų baze, susiduriama su informacijos rinkimo problema – kaip iš itin gausaus kiekio informacijos šaltinių pasirinkti tokį, kuris gebėtų informacijos naudotojui pateikti tinkamą ir jį dominančią aktualią informaciją. Taip pat svarbu gebėti analizuoti šiuolaikinius tinklalapius saugumo prasme ir ieškoti juose, pavyzdžiui, įterpto slapto kenkėjiško kodo, o tai galima padaryti tik surinkus informaciją iš tinklalapio. Be to, nauja WEB 2.0 interneto karta priverčia keisti įprastinius informacijos rinkimo metodus, nes Flash, Javascript, Ajax ir kitos naujos technologijos trukdo surinkti informaciją vien tik analizuojant įprastą HTML kodą. Šiame straipsnyje analizuojamos sudėtingų šiuolaikinių tinklalapių naršymo automatizavimui ir testavimui skirtos priemonės, kurios gali būti panaudotos informacijai rinkti. Reikšminiai žodžiai: informacijos rinkimas, dinamiški tinklalapiai, automatinis naršymas, Quick Test Pro, Sahi, Selenium, Telerik, TestComplete, Watir, Windmill
    corecore